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Abstract 



Unlike static documents, versioned documents are continuously edited by one or more authors. Such 
collaborative revision process makes traditional modeling and visualization techniques inappropriate. In 
this paper we propose a new representation based on local space-time smoothing that captures important 
revision patterns. We demonstrate the applicability of our framework using experiments on synthetic 
and real-world data. 
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1 Introduction 

Most computational linguistics studies concentrate on modeling or analyzing documents as sequences of 
^ , words. In this paper we consider modeling and visualizing versioned documents which is the authoring 

process leading to the final word sequence. In particular, we focus on documents whose authoring process 
naturally segments into consecutive versions. The revisions, as the differences between consecutive versions 
are often called, may be authored by a single author or by multiple authors working collaboratively. 

One popular way to keep track of versioned documents is using a version control system such as CVS 
or Subversion (SVN). This is often the case with books or with large computer code projects. In other 
cases, more specialized computational infrastructure may be available, as is the case with the authoring 
API of Wikipedia.org, Slashdot.com, and Google Wave. Accessing such API provides information about 
what each revision contains, when was it submitted, and who edited it. In any case, we formally consider 
a versioned document as a sequence of documents d\,...,d\ indexed by their revision number where di 
typically contains some locally concentrated additions or deletions, as compared to 

In this paper we develop a continuous representation of versioned documents that generalizes the lo- 
cally weighted bag of words representation [7]. The representation smoothes the sequence of versioned 
documents across two axes-time t and space s. The time axis t represents the revision and the space axis 
s represents document position. The smoothing results in a continuous map from a space-time domain 
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C K 2 to the simplex of term frequency vectors 



7 : Q — > Fy where 
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The mapping above (V is the vocabulary) captures the variation in the local distribution of word content 
across time and space. Thus {~/(s,t)] w is the (smoothed) probability of observing word w in space s 
(document position) and time t (version). Geometrically, 7 realizes a divergence-free vector field (since 
^2 w [y(s, t)] w = 1, 7 has zero divergence) over the space-time domain Q. 

We consider the following four versioned document analysis tasks. The first task is visualizing word- 
content changes with respect to space (how quickly the document changes its content), time (how much 
does the current version differs from the previous one), or mixed space-time. The second task is detecting 
sharp transitions or edges in the word content across the space-time domain Q,. The third task is concerned 
with segmenting the space-time domain into a finite partition reflecting word content. The fourth task is 
predicting future revision operations. 

Our main tool in addressing tasks 1-4 above is to analyze the values of the vector field 7 and its deriva- 
tives fields 



where 7 and 7 represent the first and second order partial derivatives. 

We proceed with a detailed description of our representation and four tasks. We then describe experi- 
ments on synthetic, Wikipedia, and Google Wave data. We conclude with related work and a discussion. 

2 Space-Time Smoothing for Versioned Documents 

With no loss of generality we identify the vocabulary V with positive integers {1, . . . , V} and represent a 
word w £ V by a unit vectoiQ (all zero except for 1 at the component corresponding to w) 



We extend this definition to word sequences thus representing documents (wi, ... ,wn) (wi G V) as se- 
quences of ^-dimensional vectors (e(w\), . . . , e(tu/v))- Similarly, a versioned document is sequence of 
documents . . . , of potentially different lengths = (w^\ . . . , w^^}. Using ([3]> we represent 
such a versioned document as a ragged array of unit vectors 



V7 = (is, it) 



(2) 




eO) = (0,...,0,l,0,...,0)' weV. 
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Note the slight abuse of notation as V represents both a set of words and an integer V = {1, . . . , V} with V — \V\. 
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Figure 1: Four space-time representations of a simple synthetic versioned document over V = {1,2} (see 
text for more details). The left panel displays the first component of (@]) (non-smoothed array of unit vectors 
corresponding to words). The second and third panels display [7(s,i)]i for the non-normalized and nor- 
malized representations respectively. The fourth panel displays the gradient vector field (7 s (s, t), 7t(s, t)) 
(contour levels represent the gradient magnitude). The black portions of the first two panels correspond to 
zero padding due to unequal lengths of the different versions. 

where columns correspond to space (document position) and rows to time (versions). 

The ragged array (@]) of high dimensional vectors represents the versioned document without any loss of 
information. Nevertheless the high dimensionality of V suggests we smooth the vectors in © with neigh- 
boring vectors in order to better capture the local word content. Specifically we convolve each component 
of (@1) with a 2-D smoothing kernel Kh lfl2l e.g., 

K h (x,y)ocexp(-(x 2 + y 2 )/(2/ l 2 )), (5) 

to obtain a smooth vector field 7 over space-time 

s> V 

Thus as (s, t) vary over a continuous domain Q, C M 2 , j(s, t), which is a weighted combination of neigh- 
boring unit vectors, traces a continuous surface in K . Assuming that the kernel Kh is a normalized density 
it can be shown that j(s,t) is a non-negative normalized vector i.e., j(s,t) 6 Py (see © for a definition 
of Py) measuring the local distribution of words around the space-time location (s, t). It thus extends the 
concept of lowbow (locally weighted bag of words) introduced in [7 ] from single documents to versioned 
documents. 

One difficulty with the above scheme is that the document versions d\, . . . ,di may be of different 
lengths. We consider two ways to resolve this issue. The first pads shorter document versions with zero 
vectors as needed. We refer to the resulting representation 7 as the non-normalized representation. The 
second approach to normalizes all the document versions to a common length, say I"L=i That is 

each word in the first document is expanded into Ylj^i words, each word in the second document is 
expanded into YljyL2 N(J) words etc. We refer to the resulting representation 7 as the normalized represen- 
tation. 

The non-normalized representation has the advantage of conveying absolute lengths. For example, it 
makes it possible to track how different portions of the document grow or shrink (in terms of number of 
words) with the version number. The normalized representation has the advantage of conveying lengths 
relative to the document length. For example, it makes it possible to track how different portions of the 
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document grow or shrink with the version number relatively to the total document length. In either case, the 
space-time domain Q on which 7 is defined © is a two dimensional rectangluar domain SI = [0, I] x [0, J]. 

Before proceeding to examine how 7 may be used in the four tasks described in Section[T]we demonstrate 
our framework with a simple low dimensional example. Assuming a vocabulary of two words V = {1,2} 
we can visualize 7 by displaying its first component as a grayscale image (since [7(5, t)]2 = 1 — [y(s, t)]i 
the second component is redundant). Specifically, we created a versioned document with three contiguous 
segments whose {1,2} words were sampled from a Bernoulli distribution with parameters 0.3 (first seg- 
ment), 0.7 (second segment), and 0.5 (third segment). That is, the probability of getting 1 is highest for 
the second segment, equal for the third and lowest for the first segment. The initial lengths of the segments 
were 30, 40 and 120 words with the first segment increasing and the third segment decreasing at half the 
rate of the first segment. The length of the second segment was constant across the different versions. Fig- 
ure Q] displays the nonsmoothed ragged array dU) (left), the non-normalized [7(s, (middle left) and the 
normalized [7(s,£)]i (middle right). 

While the left panel doesn't distinguish much between the second and third segment the two smoothed 
representations display a nice segmentation of the space-time domain into three segments. The non-normalized 
representation (middle left) makes it easy to see that the total length of the versioned document is increasing 
but it is not easy to judge what happens to the relative sizes of the three segments. The normalized repre- 
sentation (middle right) makes it easy to see that the first segment increases in size, the second is constant, 
and the third decreases in size. It is also possible to notice that the growth rate of the first segment is higher 
than the decay rate of the third segment. 

3 Visualizing Change in Space-Time 

We apply the space-time representation to four tasks. The first task, visualizing change, is described in this 
section. The remaining three tasks are described in the next three section. 

The space-time domain Q represents the union of all document versions and all document positions. 
Some parts of SI are more homogeneous and some are less in terms of their local word distribution. Locations 
in Q where the local word distribution substantially diverges from its neighbors correspond to sharp content 
transitions. On the other hand, locations in Q, whose word distribution is more or less constant correspond 
to slow content variation. 

We distinguish between three different types of changes. The first occurs when the word content changes 
substantially between neighboring document positions within a certain document version. As an example 
consider a document location whose content shifts from high level introductory motivation to a detailed 
technical description. The amount of such change across space is 




(7) 



Maxima points in such change may be detected by examining values of 




(8) 



that are close to zero. 
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Figure 2: Gradient and edges for a portion of the versioned Wikipedia Religion article. The left panel 
displays 1 1 -y s ( s , t)\\ 2 (amount of change across document locations for different versions). The second panel 
displays ||7j(s,t)|| 2 (amount of change across versions for different document positions). The third panel 
displays the local maxima of ||7 s (s,£)|| 2 + ||7t(s,t)|| 2 which correspond to potential edges, either vertical 
lines (section and subsection boundaries) or horizontal lines (between substantial revisions). The fourth 
panel displays boundaries of sections and subsections as black and gray lines respectively. 



A second type of change occurs when a certain document position undergoes substantial change in local 
word distribution across neighboring versions. An example is erroneous content in one version being heavily 
revised in the next version. Such change along the time axis corresponds to the magnitude of 



V 



|7«(M)I| 2 = £(*%^) (9) 



with maxima corresponding to values of 



10 = 1 



II- i *mi2 \ - i d2 h(s,t)]w \ nn , 
ll7t*M)|| =2^1 — df2 — ) • ( 10 ) 

w=l ^ 

that are close to zero. 

Expressions d7T)- ([T0l) may be used to measure the instantaneous rate of change in the local word distri- 
bution. Alternatively, integrating (T7l)-(fT0l) provides a global measure of change 



h(s) = J \\j s (s,t)\\ 2 dt, 
9(t) = [ ||7iM)H 2 ^ 



with h(s) describing the total amount of spatial change across all revisions and g{t) describing the total 
amount of version change across different document positions. h(s) may be used to detect document re- 
gions undergoing repeated substantial content revisions and g(t) may be used to detect revisions in which 
substantial content has been made across the entire document. 
We conclude with the integrated directional derivative 

-l 

||d s (r)7 s (o;(r)) + at(r)jt(a(r))\\ 2 dr (11) 



o 



where a : [0, 1] — >■ f2 is a parameterized curve in the space-time and a its tangent vector. Expression (fTTT ) 
may be used to measure change along a dynamically moving document anchor such as the boundary between 
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two book chapters. The space coordinate of such anchor shifts with the version number (due to the addition 
and removal of content across verions) and so integrating the gradient across one of the two axis as in © 
or (flOl) is not appropriate. Defining a(r) to be a parameterized curve in space-time realizing the anchor 
positions (s, i)gfi across multiple revisions, (fTTb measures the amount of change at the anchor point. 

3.1 Experiments 

The right panel of Figure Q] shows the gradient vector field corresponding to the synthetic versioned docu- 
ment described in the previous section. As expected, it tends to be orthogonal to the segment boundaries. 
Its magnitude is displayed by the contour lines which show highest magnitudes around segment boundaries. 

Figure[2]shows the norm ||ts(s, t)|| 2 (left), ||7i(s,i)|| 2 (middle left) and the local maxima of ||7 s (s, i)|| 2 + 
\\it(s, t)\\ 2 (middle right) for a portion of the versioned Wikipedia Religion article. The first panel shows 
the amount of change in local word distribution within documents. High values correspond to boundaries 
between sections, topics or other document segments. The second panel shows the amount of change as one 
version is replaced with another. It shows which revisions change the word distributions substantially and 
which result in a relatively minor change. The third panel shows only the local maxima which correspond 
to edges between topics or segments (vertical lines) or revisions (horizontal lines). 

4 Edge Detection 

In many cases documents may be divided to semantically coherent segments. Examples of text segments 
include individual news stories in streaming broadcast news transcription, sections in article or books, and 
individual messages in a discussion board or an email trail. For non-versioned documents finding the text 
segments is equivalent to finding the boundaries or edges between consecutive segments. See 1H [TJ |9l for 
several recent studies in this area. 

Things get a bit more complicated in the case of versioned documents. Segments, and their boundaries 
exist in each version. As in case of image processing, we may view segment boundaries as edges in the 
space-time domain 17. These boundaries separate the segments from each other, much like borders separate 
countries in a two dimensional geographical map. 

Assuming all edges are correctly identified, we can easily identify the segments as the interior points 
of the closed boundaries. In general, however, attempts to identify segment boundaries or edges will only 
be partially successful. As a result, in practice, predicted edges are not closed and do not lead to interior 
segments in a straightforward manner. We consider in this section the task of predicting segment boundaries 
or edges in O. In the next section we examine the task of predicting a segmentation of the versioned 
document. 

Edges, or transitions between segments, correspond to abrupt changes in the local word distribution. As 
such we characterize them as points in Q having high gradient value. In particular, we distinguish between 
vertical edges (transitions across document positions), horizontal edges (transitions across versions), and 
diagonal edges (transitions across both document position and versions-as is the case with anchor points 
described above). These three types of edges may be diagnosed based on the magnitudes of 7 S , 7 t , and 
di7 s + d2 7t respectively. 

4.1 Experiments 

Besides the synthetic data results in Figure 12 we conducted edge detection experiments on five different real 
world datasets. Four datasets are Wikipedia.com articles: Religion, Atlanta, Language, and European Union. 
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Figure 3: Gradient and edges of a portion of the versioned Atlanta Wikipedia article (top row) and the 
Google Wave Amazon Kindle FAQ (bottom row). The left column displays the magnitude of the gradient 
in both space and time ||7 s (s,t)|| 2 + ||7t(s,t)||. The middle column displays the local maxima of the 
gradient magnitude (left column). The right column displays the actual segment boundaries as vertical lines 
(section headings for Wikipedia and author change in Google Wave). The gradient maxima corresponding to 
vertical lines in the middle column matches nicely the Wikipedia section boundaries. The gradient maxima 
corresponding to horizontal lines in the middle column correspond nicely to major revisions indicated by a 
discontinuities in the location of the section boundaries. 
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Article 



Revisions Voc. Size p(y) 



Accuracy 
b 



Fl Measure 

b c 



a 



c 



a 



Religion 2000 

Atlanta 2000 

Language 2000 

European Union 2000 

Amazon Kindle FAQ 100 



2880 
3078 
3727 
2382 
573 



0.404 
0.401 
0.292 
0.534 
0.339 



0.596 
0.599 
0.706 
0.526 
0.656 



0.568 
0.575 
0.550 
0.456 
0.480 



0.641 
0.666 
0.698 
0.560 
0.683 



0.0 

0.0 

0.0 

0.626 

0.0 



0.470 
0.467 
0.379 
0.397 
0.440 



0.577 
0.539 
0.140 
0.657 
0.540 



Figure 4: Test-set accuracy and Fl measure for edge prediction (section boundaries in Wikipedia articles 
and author change in Google Wave). The space-time domain was divided to a grid with each cell labeled 
edge (y = 1) or no edge (y = 0) depending on whether it contained any edges. Method a corresponds to a 
predictor that always selects the majority class. Method b corresponds to the TextTiling test segmentation 
algorithm |j6] without paragraph boundaries information. Method c corresponds to a logistic regression 
classifier whose feature set is composed of statistical summaries (mean, median, max, min) of A / S (s,t) 
within the grid cell in question as well as neighboring cells. 

Religion and European Union are versioned documents with relatively frequent updates, while Atlanta and 
language have less frequent changes. The fifth dataset is the Google Wave Amazon Kindle FAQ which is a 
versioned document with less structure than the Wikipedia articles. 

Preprocessing included removing html tags and pictures, word stemming, stop-word removal, and re- 
moving any non alphabetic characters (numbers and punctuations). The section heading information of 
Wikipedia and the information of author of each posting in Google Wave is used as ground truth for seg- 
ment boundaries. This information was separated from the dataset and was used for training (on the training 
set) or for evaluation (on testing set). 

Figure [3] displays a gradient information, local maxima, and ground truth segment boundaries for the 
versioned Wikipedia articles Religion and Atlanta. The local gradient maxima nicely match the segment 
boundaries which lead us to consider training a logistic regression classifier on a feature set composed 
of gradient value statistics (min, max, mean, median of ||7 s (s, t)\\ in the appropriate location as well as 
its neighbors (the space-time domain $7 was divided into a finite grid where each cell either contained an 
edge (y = 1) or did not (y = 0)). The table in Figure [4] displays the test set accuracy and Fl measure 
of three predictors: our logistic regression (method c) as well as two baselines: predicting edge/no-edge 
based on which label is most probably (method a) and using TextTiling (method b) |6l which is a popular 
text segmentation algorithm. Since we do not assume paragraph information in our experiment we ignored 
this component and considered the document as a sequence with a w = 20 and 29 minimum depth gaps 
parameters (see (H). We conclude from the figure that the gradient information leads to better prediction 
than TextTiling (on both accuracy and Fl measure). 

5 Segmentation 

As mentioned in the previous section, predicting edges may not result in closed boundaries. It is possible 
to analyze the location and direction of the predicted edges and aggregate them into a sequence of closed 
boundaries surrounding the segments. We take a different approach and partition points in fi to k distinct 
values or segments based on the local word content and space-time proximity. 

For two points (si^), (■52,^2) G O to be in the same segment we expect 7(si,ii) to be similar to 
7(^2 ; ^2) and for (si,ti) to be close to (S2, £2)- The first condition asserts that the two locations discuss the 
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Figure 5: Predicted segmentation (top) and ground truth segment boundaries (bottom) of portions of the ver- 
sioned Wikipedia articles Religion (left), Atlanta (middle) and the Google Wave Amazon Kindle FAQ(right). 
The predicted segments match the ground truth segment boundaries. Note that the first 100 revisions are used 
in Google Wave result. The proportion of the segments that appeared in the beginning is keep decreasing 
while the revisions increases and new segments appears. 
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Article Revisions Voc. Size p(y) Accuracy Fl Measure 

a b c a b c 

Religion 2000 2880 0.123 0.878 0.878 0.878 O0 0.216 0.217 

Atlanta 2000 3078 0.218 0.780 0.780 0.781 0.0 0.360 0.360 

Language 2000 3727 0.189 0.811 0.812 0.810 0.0 0.317 0.318 

European Union 2000 2382 0.213 0.786 0.787 0.787 0.0 0.352 0.352 

Figure 6: Accuracy and Fl measure over held out test set of predicting future UNDO operation in Wikipedia 
articles. Method a corresponds to a predictor that always selects the majority class. Method b corresponds 
to a SVM classifier based on the term frequency vector of the current version. Method c corresponds a SVM 
classifier that uses summaries (mean, median, max, min) of ||7 s (s,i)||, ||-y s (s, t)\\, g(t), and h(s). 



same topic. The second condition asserts that the two locations are not too far from each other in the space 
time domain. More specifically, we propose to segment O, by clustering its points based on the following 
geometry 

d((sx, h), (s 2 ,t 2 )) = d H (-f( Sl , h), 7 (s 2> h)) + - s 2 ) 2 + c 2 (h - t 2 ) 2 (12) 

where dn '■ IV x IV — >• M is the Hellinger distance 

V 

d 2 H (u,v) = ^(V^i-V^) 2 - (13) 
i=i 

The weights c\ , c 2 are used to balance the contributions of the word content similarity with the similarity in 
time and space. 



5.1 Experiments 

Figure [5] displays the ground truth segment boundaries and the segmentation results obtained by applying 
fc-means clustering (k = 11) to the metric (PT2l) . This qualitative result shows that the predicted segments 
match the actual edges in the documents even though no edge or gradient information was used in the 
segmentation process. 



6 Predicting Future Operations 

The fourth and final task is predicting a future revision di + \ based on the smoothed representation of 
the present and past versions d\,...,d\. In terms of Q, this means predicting features associated with 
7(s, t), t > t' based on 7(5, t), t < t'. 



6.1 Experiments 

We concentrate on predicting whether Wikipedia versions are reversed in the next revision. Such a reversal 
is marked using a label UNDO or REVERT in the Wikipedia API. These operations are used for preventing 
abuse or removing immature content. 

We predict whether a version will undergo UNDO in the next version using a support vector machine 
based on statistical summaries (mean, median, min, max) of the following feature set ||7 s (s, t)\\, ||7<s(s, t)\\, 
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||7t(s,£)||), ||7t( s ) Oil' 9{h)> an d h(s). Figure [6] shows the test set accuracy and Fl measure for the SVM 
classifier based on the smoothed space-time representation (method c), as well as two baselines. The first 
baseline (method a) predicts the majority class and the second baseline (method b) is a support vector 
machine based on the term frequency content of the current test version. Using the derivatives of 7 we 
obtain a prediction that is comparable (slightly better) to the term frequency or word content SVM. We 
thus conclude that the derivatives above provide useful information for the prediction of future operation 
achieving accuracy similar to the word content. This is particularly remarkable since the derivatives feature 
set is substantially smaller than the high dimensional tf vector. 

7 Related Work 

While document analysis is a very active research area, there has been relatively little work on examining 
versioned documents. Our approach is the first to consider versioned documents as continuous mappings 
from a space-time domain to the space of local word distributions. It extends the ideas in [H of using 
kernel smoothing to create a continuous representation of documents. In fact, our framework is more general 
as it reverts to [7 ] when there is only a single revision. 

Other approaches to sequential analysis of documents concentrate on discrete spaces and discrete mod- 
els, with the possible extension of |[T3l . Related papers on segmentation and sequential document analysis 
are (6j [U with O being the closest in spirit to our approach. An influential model for topic model- 
ing within and across documents is latent Dirichlet allocation (3JI3. Our approach differs in being fully 
non-parametric and in that it does not require iterative parametric estimation or integration. The paper Q 
contains a statistical interpretation of local word smoothing as a parameter estimator. This interpretation 
may be extended to our paper in a straightforward manner. 

Several attempts have been made to visualize themes and topics in documents, either by keeping track of 
the word distribution or by dimensionality reduction techniques e.g., lH|5j[T0l[Ill. Such studies, however, 
tend to visualize a corpus of unrelated documents as opposed to versioned documents that are naturally 
ordered. 

8 Summary and Discussion 

The task of analyzing and visualizing versioned document is an important one. It allows external control 
and monitoring of collaboratively authored resources such as Wikipedia, Google Wave, and CVS or SVN 
documents. Our framework is the first to develop analysis and visualization tools in this setting. It presents a 
new representation for versioned documents that uses local smoothing to map a space-time domain Q to the 
simplex of tf vectors Fy. We demonstrate the applicability of the representation for four tasks: visualizing 
change, predicting edges, segmentation, and predicting future revision operations. Experiments conducted 
on synthetic, Wikipedia and Google Wave articles show that it achieves good performance both qualitatively 
and quantitatively (as compared to baseline predictors). 

It is intriguing to consider the similarity between our representation and image processing. Predicting 
segment boundaries are similar to edge detection in images. Segmenting versioned documents may be 
reduced to image segmentation. Predicting future operation is similar to completing image parts based on 
the remaining pixels and a statistical model. Due to its long and successful history, image processing is 
a good candidate for providing useful tools for versioned document analysis. Our framework facilitates 
this analogy and we believe is likely to result in novel models and analysis tools inspired by current image 
processing paradigms. A few potential examples are wavelet filtering, image compression, and statistical 
models such as Markov random fields. 
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